This week’s problem sheet focuses on the methods for analyzing point pattern and lattice/areal unit data in Sections 4.5-4.7 in the lecture notes. Exercises 1-2 help you with revising the content of the lecture in Week 9, and Tutorial Questions 1 and 2 provide some more advanced questions.
Your answer to the Homework Question can be submitted on Moodle to your tutor for feedback. The submission deadline is 17:00 on Thursday 24 April 2025. You should submit a single PDF or Word file that provides your R code, any created R output and all your comments.
You may want to load the following packages before starting the exercise:
library( dplyr )
library( ggplot2 )
library( tidyr )
library( sf )
library( ggspatial )
library( prettymapr )
library( spatstat )
library( raster )
When working on a University PC, you will have to first install some of these packages.
The file “Outbreaks.csv” contains the data for animal disease outbreaks between September 2015 and August 2017 as provided by the EMPRES Global Animal Disease Information System. Each entry in the data set specifies the location, date and type of disease for an individual outbreak. In this question we want to focus on the African swine fever that has affected large parts of Europe in the past years. Address the following three research tasks / questions and then complete the Moodle quiz.
We start by loading the data frame containing the information about outbreaks:
Outbreaks <- read.csv("Outbreaks.csv")
Since we want to focus on African swine fever, we have to remove all observations on other diseases from the data set:
SwineFever <- filter( Outbreaks, disease == "African swine fever" )
Similar to the operations for data wrangling and text data analysis, we now count how iften each country appears in the data set, and we then print out the top three:
SwineFever %>%
count( country, sort=TRUE) %>%
slice_head( n=3 )
## country n
## 1 Estonia 927
## 2 Latvia 742
## 3 Lithuania 615
We find that the three Baltic states (Estonia, Latvia and Lithuania) had the highest number of outbreaks.
We want to focus on the Baltic states, and thus we have to extract the observations related to these countries from the data
SwineFever_Baltic <- SwineFever %>%
filter( country %in% c( "Estonia", "Latvia", "Lithuania" ) )
We can now create a map using the techniques we covered in the lecture, and which already used for visualizing point-referenced data:
ggplot( SwineFever_Baltic, aes( x=longitude,y=latitude ) ) +
annotation_map_tile( zoom=6 ) + geom_spatial_point( ) +
labs( x="Longitude", y="Latitude" )
The map indicates that most outbreaks are in the eastern half of the Baltic states, with a particularly high numbers in Estonia and Latvia.
We now only focus on Latvia, so we have to extract the relevant data:
SwineFever_Latvia <- SwineFever %>% filter( country == "Latvia" )
Using a shapefile for Latvia, we can construct the ppp object as demonstrated in the lecture notes:
Latvia <- read_sf( "Shapefiles/gadm41_LVA_0.shp" )
Latvia <- Latvia %>%
st_simplify( dTolerance = 1000 ) %>%
st_transform( crs=3857 )
SwineFever_transformed <- SwineFever_Latvia %>%
st_as_sf( coords=c("longitude","latitude"), crs=4326 ) %>%
st_transform( crs=3857 ) %>%
st_coordinates( )
Latvia_ppp <- ppp( SwineFever_transformed[,1],
SwineFever_transformed[,2],
window = as.owin(Latvia) )
Let’s start by deriving the quadrat counting estimate:
par( mai=c(0.01,0.01,0.5,0.01) )
Latvia_Quadrats <- quadratcount( Latvia_ppp, nx=7, ny=5 )
plot( Latvia_Quadrats,
main="Quadrat counting for swine fever outbreaks across Latvia" )
While the plot above already provides sufficient information to answer the question, we also produce the kernel smoothed intensity function estimate fr illustration (in practice we only provide either the quadrat counting or kernel smoothed intensity plot):
par( mai=c(0.01,0.5,0.2,0.5) )
lambdaC <- density.ppp( Latvia_ppp, edge=TRUE )
plot( lambdaC, main="Swine fever outbreaks across Latvia" )
Both plots support our observations from part b) that hardly any outbreaks occur in the West of Latvia, while a high number of outbreaks is in the eastern half, in particular the north-eastern part of Latvia. Due to these large differences, we can conclude that the point process is non-homogeneous.
The file “Crimes France.csv” contains crime statistics from 2015 for all French départments (except Corsica). Specifically, for each department we are provided with the rate of burglaries and violence per 1,000 people. The file uses the UTF-8 encoding and you should ensure to specify this when loading the file. A shapefile for France with the départment boundaries is provided in the file “gadm41_FRA_2.shp”. Use the techniques for areal unit data to complete the following tasks and then go to Moodle and answer the quiz questions.
We start by loading the shapefile:
France <- read_sf( "Shapefiles/gadm41_FRA_2.shp" )
We reduce the complexity of the shapefile using st_simplify():
France <- France %>% st_simplify( dTolerance = 1000 )
We are now ready to plot the shapefile on top of a map:
ggplot( France ) + theme_bw() +
annotation_map_tile( zoom=6 ) +
geom_sf( data=France, alpha=0.7 ) +
labs( x="Longitude", y="Latitude" )
We first load the data file and we have to specify the encoding in this case (otherwise the accents are not imported correctly):
Crime_France <- read.csv( "Crimes France.csv", header=TRUE, fileEncoding = "UTF-8" )
Time to match the department names in the shapefile with that in the data file:
France <- inner_join( x=France, y=Crime_France, by=c("NAME_2"="Department") )
We can now produce the map as seen in the data examples for lattice data in Section 4.7 of the lecture notes:
ggplot( ) +
annotation_map_tile( zoom=6 ) +
geom_sf( data=France, aes(fill=Violence.per.1000.pop) ) + theme_bw() +
scale_fill_distiller( palette="Reds", trans="reverse" ) +
theme( axis.title=element_text(size=15), axis.text=element_text(size=15) ) +
labs( x="Longitude", y="Latitude", color="Violence per 1000 people" )
The map shows that the highest violence rates were observed for Paris. High rates were also recorded along the French Riviera, around Paris, and in the northern départments close to the border with Belgium.
We just need to change the variable we are plotting to create the plot for the rate of burglaries:
ggplot( ) +
annotation_map_tile( zoom=6 ) +
geom_sf( data=France, aes(fill=Burglaries.per.1000.pop) ) + theme_bw() +
scale_fill_distiller( palette="Reds", trans="reverse" ) +
theme( axis.title=element_text(size=15), axis.text=element_text(size=15) ) +
labs( x="Longitude", y="Latitude", color="Violence per 1000 people" )
We find that higher rates of burglaries were observed in southern France, with south-west France and the area around Lyon as the hotspots.
Utopia’s police department needs your help with analyzing their 2015-2021 data regarding certain crimes. The data is provided in the file “UtopiaCrimes.csv”. Note, longitude and latitude coordinates are only available for drug possession offences across District 44.
Utopia consists of 59 districts and a shapefile of Utopia is provided as “UtopiaShapefile.shp”. To hide Utopia’s location, constants have been added to the latitude and longitude coordinates, but the shapes they define are correct. The population for each district is provided in the file “UtopiaPopulation.csv”.
To extract the three most common crimes, we only need to consider in “UtopiaCrimes.csv”
Crimes <- read.csv( "UtopiaCrimes.csv" )
We then count the number of incidents for each crime category
Crimes %>%
count( Category, sort=TRUE ) %>%
slice_head( n=3 )
## Category n
## 1 Burglary 16513
## 2 Drug Possession 10551
## 3 Assault 10169
We find that burglaries, drug possession and assault were the most common crimes for 2015-2021.
Given the results in part a), we have to create a map for burglaries. We start by calculating the number of burglaries for each district:
Burglaries <- Crimes %>%
filter( Category == "Burglary" ) %>%
count( District )
We then derive the number of cases per 1,000 population for each district
Population <- read.csv( "UtopiaPopulation.csv" )
Burglaries <- Burglaries %>%
inner_join( Population, by="District" ) %>%
mutate( Rate = 1000 * n / Population )
We next load the shape file
Utopia <- read_sf( "Shapefiles/UtopiaShapefile.shp" )
Finally, we add the calculated rate of burglaries to the data frame provided by the shapefile and visualize the lattice data
Utopia <- Utopia %>% inner_join( Burglaries, by=c("NAME_1"="District") )
ggplot( Utopia, aes(fill=Rate) ) + geom_sf() + theme_bw() +
labs( title="Rate of Burglaries per 1000 people",
x="Longitude", y="Latitude", fill="Rate" )
The plot reveals that northern / north-western Utopia is worst affected in terms of the rate of burglaries.
To address this question, we have to perform point pattern analysis. We start by filtering out the observations of drug possession for District 44 and we only consider the two most recent years:
District44_Drugs <- filter( Crimes, Category == "Drug Possession",
District == "District 44", Year>=2020 )
We next extract the shapefile for District 44 and define a ppp object. We introduced two methods and the code for both is provided below. Option 1 is to use the approach from Section 4.6.2 in the lecture notes, but here we need to use the function st_reserve() to make it work:
District44 <- filter( Utopia, NAME_1 == "District 44" )
District44 <- st_reverse( District44 )
District44_ppp <- ppp( District44_Drugs$Longitude, District44_Drugs$Latitude,
poly = District44$geometry[[1]][[1]] )
The second option is to employ the approach in Section 4.6.4 - the approaches give a different scale in terms of values, but the conclusions are identical (so you use whichever you prefer):
Drugs_transformed <- District44_Drugs %>%
st_as_sf( coords=c("Longitude","Latitude"), crs=4326 ) %>%
st_transform( crs=3857 ) %>%
st_coordinates( )
District44 <- District44 %>% st_transform( crs=3857 )
District44_ppp <- ppp( Drugs_transformed[,1],
Drugs_transformed[,2],
window = as.owin(District44) )
Finally, we estimate and plot the kernel smoothed intensity function to identify the area with the highest intensity
par( mai=c(0.01,0.01,0.5,0.01) )
plot( density.ppp(District44_ppp, edge = TRUE),
main="Drug Possession across District 44" )
The plot suggests that the police should focus on the centre of District 44 since this is the area with the highest intensity. There is also a second area in the north-east of District 44 where a raid may also be quite successful.
As many other countries around the world, the United States have seen high inflation over the past years. In the following we want to analyze socioeconomic data for the state of Texas. We have access to two data files:
“Cost_Texas.csv” gives the cost for essentials (food, rent, healthcare,etc.) in 2023 for a single household for each county in Texas based on the Family Budget Calculator by the Economic Policy Institute.
“Income_Texas.csv” gives the median income per capita for each county in Texas as reported by the Bureau of Economic Analysis of the US government for 2021.
Perform the following tasks and answer the research questions:
We start by loading the two data sets
Cost_Texas <- read.csv("Cost_Texas.csv", fileEncoding = "UTF-8")
Income_Texas <- read.csv("Income Texas.csv", fileEncoding = "UTF-8")
A look at the data reveals that the numbers on income include a comma, which prevents this variable from being recognized as a numerical variable. So we have to do some data cleaning first
Income <- Income_Texas %>%
mutate( Income.per.capita = gsub("\\,","",Income.per.capita) ) %>%
mutate( Income.per.capita = as.numeric(Income.per.capita) )
We can now merge the two data sets based on county names:
Texas_money <- inner_join( Income, Cost_Texas, by=c("County"="county") )
Finally, we can calculate the difference between income and cost as required:
Texas_money <- Texas_money %>% mutate( Diff = Income.per.capita - total_cost )
We first load the shapefile for Texas
load( "Shapefiles/Texas.RData" )
We can now attach the information retrieved in part a) to the data frame in Texas
TX <- inner_join( x=Texas, y=Texas_money, by="County" )
The map can then, for instance, be created with ggspatial
ggplot( ) +
annotation_map_tile( zoom=5 ) +
geom_sf( data=TX, aes(fill=Diff) ) + theme_bw() +
scale_fill_distiller( palette="Reds", trans="reverse" ) +
theme( axis.title=element_text(size=15), axis.text=element_text(size=15) ) +
labs( x="Longitude", y="Latitude", fill="Income - Cost" )
We see quite a difference in values ranging from values around zero (the smallest value is about -1,900$), to over $80,000. So we would conclude that there are counties where the income of a single household is pretty much entirely spent on the essential costs, while there are some counties where people can continue saving.
There are at least two aspects that may affect reliability. The first is that the income and cost data are not for the same year. So we should not conclude that the values are fully accurate, but they give some indications on which we can comment with some certainty. The second aspect is that the income data is for people working in that area, while the cost data is representative for people living in the area. Let’s retrieve the two counties with the highest difference:
slice_max( Texas_money, Income.per.capita, n=2 )
## County Income.per.capita isMetro total_cost Diff
## 1 Midland 126738 TRUE 43795.24 82942.76
## 2 Glasscock 124963 FALSE 38365.05 86597.95
We find that that the two counties with the highest median income are Midland and Glasscock, and these counties also have a difference above $80,000. Looking at Wikipedia, these two counties are centres of the oil industry in Texas. One feature of these jobs is that the people working in the oil industry are not necessarily living in that area permanently. Consequently, our analysis for these counties may biased due to this feature.
The grey squirrel is classified as an invasive species in the UK, and it has displaced the native red squirrel across large parts of the UK. A wildlife conservation charity has collected data on reported sightings of grey squirrels for 2020-2022. The data they provided includes the following:
GreySquirrels.csv: Sightings of grey squirrels reported to the wildlife conservation charity for 2020-2022:
Year: Year the sighting happened
Lon: Longitude coordinate of the sighting
Lat: Latitude coordinate of the sighting
Admin: Name of the administrative area, where the grey squirrel was sighted
UK Shapefile: Folder containing shapefiles for the UK
UK.shp: Shapefile without administrative boundaries
UK admin.shp: Shapefile with boundaries for the administrative areas named in the file “GreySquirrels.csv”
The charity assured us that the data is representative of the spatial distribution of squirrels across Great Britain for all years. They ask you to use the data to investigate the following aspects:
What can we say about the spatial distribution of grey squirrels across Great Britain in 2022?
Are there any areas of Great Britain that saw a notable change in the number of grey squirrels when we compare the data for 2020 and 2022?
We start by loading the data and the shapefiles
UK <- read_sf( "UK Shapefile/UK.shp" )
UK_county <- read_sf( "UK Shapefile/UK_admin.shp" )
Squirrels <- read.csv( "GreySquirrels.csv" )
To explore the spatial distribution of the grey squirrel for 2022, we extract the relevant observations
Squirrels2022 <- filter( Squirrels, Year == 2022 )
Since we are working with point pattern data, we use techniques for analyzing point pattern data and create a ppp object
UK <- st_reverse( UK )
Squirrels2022_ppp <- ppp( Squirrels2022$Lon, Squirrels2022$Lat, poly = UK$geometry[[1]][[1]] )
## Warning: 11 points were rejected as lying outside the specified window
Note: Some of the points are lying outside the area, because we only took the largest polygon. The omission of these few points will not affect our results due to there being about 10,800 observations in the data.
We can now use the smoothed kernel intensity function to explore the spatial distribution
lambdaC <- density.ppp( Squirrels2022_ppp, edge=TRUE, sigma=0.1 )
plot( lambdaC, main="Grey Squirrels across Great Britain" )
The plot suggests that the highest density of grey squirrels can be found in the East Midlands, with a high intensity observed throughout England (except for Cornwall and Devon). On the other hand, we hardly observe any squirrels in West Wales and Scotland.
To measure the change in numbers, one option is to focus on the administrative areas and calculate the number of grey squirrels for 2020 and 2022 for each of them
Squirrels_county <- Squirrels %>%
count( Year, County ) %>%
filter( Year %in% c(2020,2022) ) %>%
rename( Count = n ) %>%
ungroup()
Let’s visualize the counts within each administrative boundary
Squirrels2022 <- filter( Squirrels_county, Year == 2022 )
UK_county <- full_join( UK_county, Squirrels2022, by=c("NAME_2"="County") )
ggplot( UK_county, aes(fill=Count) ) +
annotation_map_tile(zoom=5) +
theme_bw() + geom_sf() +
scale_fill_distiller( palette="Reds", trans="reverse" )
We see that North Yorkshire had the highest count in 2022, although it was not the area with the highest intensity - these differences are due to the different areas sizes.
The next step is to calculate the raw change in numbers for each administrative boundary
Squirrels_county <- Squirrels_county %>%
pivot_wider( names_from = Year, values_from = Count ) %>%
mutate( `2020` = replace_na(`2020`,0), `2022` = replace_na(`2022`,0) ) %>%
mutate( Change = `2022` - `2020` )
Finally, we visualize the calculated numbers:
UK_county <- UK_county %>% full_join( Squirrels_county, by=c("NAME_2"="County") )
ggplot( UK_county, aes(fill=Change) ) +
annotation_map_tile(zoom=5) +
geom_sf() + theme_bw() +
scale_fill_distiller( palette="Reds", trans="reverse" )
We find that the areas that had the largest change in raw numbers are all located in northern England, in particular Cumbria saw a big increase in the number of grey squirrels.